Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make maximum depth configurable via CLI option #3063

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mgree
Copy link

@mgree mgree commented Mar 11, 2024

Addresses #2846. (I didn't see any contributor guidelines/linting info... please let me know if you'd like me to change anything!)

This PR adds state to two places to track maximum parser depth: in jq_state and in jv_parser.

It might seem like jv_parser is enough on its own, but many other parts of the code parser values holding only a jq_state. It's not clear to me that these aren't mostly/entirely raw parses (which wouldn't need to track depth), but it seemed easiest to track the state thoroughly.

There are a few helper functions at the end of jv_parse.c that have no handle at all---neither a jq_state nor a jv_parser. I had these just use the DEFAULT_MAX_PARSING_DEPTH. They are mostly for fuzzing, it seems, but jv_parse gets called in main during option processing. It wouldn't be hard to have this use the current parser_maxdepth.

@wader
Copy link
Member

wader commented Mar 12, 2024

Implementation looks good to me. There was some discussion in #2846 about what the default should be, unbounded or 256 (or something else?).

@wader
Copy link
Member

wader commented Mar 12, 2024

Should add CLI documentation to manual.yml

src/execute.c Outdated
void jq_set_parser_maxdepth(jq_state *jq, int parser_maxdepth)
{
jq->parser_maxdepth = parser_maxdepth;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing new lines?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, added it on in 586e96d! (Would you want a PR that adds this as a CI lint?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe would be good, i like linters. Let's see what the other maintainers think

@mgree
Copy link
Author

mgree commented Mar 12, 2024

Should add CLI documentation to manual.yml

In 3b0d520.

Implementation looks good to me. There was some discussion in #2846 about what the default should be, unbounded or 256 (or something else?).

Thanks!

I would recommend leaving the default maximum parsing depth at 256, as changing the default depth to unbounded opens up a DoS vector. It might make sense to recommend the --depth argument in the error message... should I add that in to the message?

@@ -269,6 +269,13 @@ sections:
available in the program and has a string whose contents are to the texts
in the file named `bar`.

* `--depth n`:

This option sets the maximum parsing depth (of objects and lists) to `n`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe array instead of lists?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c4e6358.

@wader
Copy link
Member

wader commented Mar 12, 2024

Should add CLI documentation to manual.yml

In 3b0d520.

👍

Implementation looks good to me. There was some discussion in #2846 about what the default should be, unbounded or 256 (or something else?).

Thanks!

I would recommend leaving the default maximum parsing depth at 256, as changing the default depth to unbounded opens up a DoS vector.

Ok, let's see what @nicowilliams thinks

It might make sense to recommend the --depth argument in the error message... should I add that in to the message?

Yeap sounds like a good idea

@wader
Copy link
Member

wader commented Mar 12, 2024

You probably want to do make jq.1.prebuilt tests/man.test to regenerate the man page and man tests

@wader
Copy link
Member

wader commented Mar 12, 2024

BTW https://greenberg.science/ layout is 🥇😄

@mgree
Copy link
Author

mgree commented Mar 20, 2024

Do y'all have a sense of a timeline on this merging (and the next release)? (Not asking to rush you, just for planning on our end.)

@wader
Copy link
Member

wader commented Mar 20, 2024

@mgree Hey, i pinged the other maintainers on discord yesterday about this PR and some other ones so hopefully they review it soon. As it adds a new cli argument it's probably good that at least 2 others approve it.

Copy link
Member

@wader wader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Only non-blocking thing for me is if the default should be unbounded instead?

mgree added a commit to MaterializeInc/materialize that referenced this pull request Mar 20, 2024
Some basic scripts for analyzing queries from `bin/mzexplore extract`.

These scripts will fail on extracted JSON ASTs without a patched version of jq <jqlang/jq#3063>.

### Motivation

  * This PR adds a known-desirable feature.

  Related to #17592 and #25337.

### Checklist

- [ ] This PR has adequate test coverage / QA involvement has been duly
considered.
- [ ] This PR has an associated up-to-date [design
doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md),
is a design doc
([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)),
or is sufficiently small to not require a design.
  <!-- Reference the design in the description. -->
- [ ] If this PR evolves [an existing `$T ⇔ Proto$T`
mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md)
(possibly in a backwards-incompatible way), then it is tagged with a
`T-proto` label.
- [ ] If this PR will require changes to cloud orchestration or tests,
there is a companion cloud PR to account for those changes that is
tagged with the release-blocker label
([example](MaterializeInc/cloud#5021)).
<!-- Ask in #team-cloud on Slack if you need help preparing the cloud
PR. -->
- [ ] This PR includes the following [user-facing behavior
changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note):
- <!-- Add release notes here or explicitly state that there are no
user-facing behavior changes. -->
@wader
Copy link
Member

wader commented Mar 28, 2024

Looking at the issue @mgree ran into in MaterializeInc/materialize#25972 with the 256 depth limit i would say it's a good argument for increasing it or just make it unbounded by default.

@mgree
Copy link
Author

mgree commented Mar 29, 2024

For Materialize's use case, we would want no depth limit : we're looking at non-adversarial inputs that we know to be deeply nested (ASTs with a particularly nested JSON representation). But we're happy to use --depth 0 on all of our invocations---NBD.

A quick experiment shows it's probably pretty safe to have no limit by default from a time performance perspective, but maybe not a memory or DoS perspective. Objects nested 1024 * 1024 deep make a file of 6MB but jq allocates 0.5GB before segfaulting; lists nested 1024 * 1024 deep make a file of 2MB but jq allocates 0.3GB before segfaulting. Scaling seems to be linear: if make an object of depth 1024 * 1024 * 4, the 25MB file leads jq to allocate just under 2GB, segfaulting after 1.7s.

At a minimum, I'd guess these segfaults should be caught and turned into proper aborts. At worst, maybe they're bugs? (Maybe with printing? Because a lot of work seems to happen before things crash...)

It's might be kindest to users to have a maximum depth by default---that way users don't discover the hard way that adversarial files can use lots of memory and/or crash jq. Not that I would use jq on nasty files willy nilly, but... people do all kinds of things. 😁

Here's a script for making deep JSON objects and lists (EDIT 2024-04-03: noticed a bug in deep_list.sh 🙄):

#!/bin/sh
# deep_obj.sh

jot -b '{"a":' -s '' $1 >obj$1.json
echo 0 >>obj$1.json
jot -b '}' -s '' $1 >>obj$1.json

#!/bin/sh
# deep_list.sh

jot -b '[' -s '' $1 >list$1.json
echo 0 >>list$1.json
jot -b ']' -s '' $1 >>list$1.json

And here's a run on my not fancy macOS laptop (EDIT 2024-04-03: re-ran with the bug fixed, didn't change overall results):

$ for x in 257 1024 $((1024 * 64)) $((1024 * 256)) $((1024 * 1024))
do
  ./deep_list.sh $x
  ./deep_obj.sh $x
  echo \# $x
  echo \#\# list
  command time -l jq --depth 0 . list$x.json >/dev/null
  echo \#\# obj
  command time -l jq --depth 0 . obj$x.json >/dev/null
done
# 257
## list
        0.01 real         0.00 user         0.00 sys
             2101248  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                 722  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                   7  involuntary context switches
            76583002  instructions retired
            35534464  cycles elapsed
             1290240  peak memory footprint
## obj
        0.01 real         0.00 user         0.00 sys
             2007040  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                 701  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                  14  involuntary context switches
            77345793  instructions retired
            35134742  cycles elapsed
             1191936  peak memory footprint
# 1024
## list
        0.01 real         0.00 user         0.00 sys
             2322432  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                 776  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                  20  involuntary context switches
            77963756  instructions retired
            37622876  cycles elapsed
             1273856  peak memory footprint
## obj
        0.01 real         0.00 user         0.00 sys
             2375680  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                 789  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                  12  involuntary context switches
            79654892  instructions retired
            36269047  cycles elapsed
             1228800  peak memory footprint
# 65536
## list
        0.03 real         0.02 user         0.01 sys
            24293376  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                6139  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                  37  involuntary context switches
           199628394  instructions retired
           114804163  cycles elapsed
            23146496  peak memory footprint
## obj
        0.05 real         0.03 user         0.01 sys
            36016128  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                9003  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                  32  involuntary context switches
           289287462  instructions retired
           161012325  cycles elapsed
            32870400  peak memory footprint
# 262144
## list
time: command terminated abnormally
        0.07 real         0.05 user         0.02 sys
            86921216  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
               21431  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                  63  involuntary context switches
           387970009  instructions retired
           252783614  cycles elapsed
            85774336  peak memory footprint
Segmentation fault: 11
## obj
time: command terminated abnormally
        0.13 real         0.09 user         0.04 sys
           138883072  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
               35396  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                  63  involuntary context switches
           712225649  instructions retired
           442663148  cycles elapsed
           126955520  peak memory footprint
Segmentation fault: 11
# 1048576
## list
time: command terminated abnormally
        0.26 real         0.17 user         0.08 sys
           327397376  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
               82700  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                  82  involuntary context switches
          1311119871  instructions retired
           847893675  cycles elapsed
           321007616  peak memory footprint
Segmentation fault: 11
## obj
time: command terminated abnormally
        0.46 real         0.30 user         0.15 sys
           514150400  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
              130856  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                 199  involuntary context switches
          2446990455  instructions retired
          1517820450  cycles elapsed
           502243328  peak memory footprint
Segmentation fault: 11

@mgree
Copy link
Author

mgree commented May 1, 2024

Hey, I wanted to check in again. Would you like me to file correctness issues for these segfaults? Is there anything I can do to help with the review of this patch?

@wader wader requested a review from nicowilliams May 14, 2024 21:58
@wader wader force-pushed the configurable-parsing-depth branch 2 times, most recently from e3ba70a to 6c6e47f Compare August 24, 2024 15:16
@wader
Copy link
Member

wader commented Aug 24, 2024

@mgree force pushed rebase on master that fixes conflict in manual.yml

@itchyny @emanuele6 looks ok to merge? fixes #2846 that is on the 1.8 milestone. The only thing left for me would be if we should increase DEFAULT_MAX_PARSING_DEPTH while we're at it?

It's not sufficient to just add the state to the parser, since we
might call `jv_load_file` from a variety of other functions.

A few helper functions defer to `DEFAULT_MAX_PARSING_DEPTH`, as they
are given neither a `jq_state` nor `jv_parser`.

Added tests in `tests/shtest` to confirm that the maximum depth limit
is appropriately enforced.
@wader wader force-pushed the configurable-parsing-depth branch from 6c6e47f to a62f0d4 Compare November 25, 2024 13:39
@wader
Copy link
Member

wader commented Nov 25, 2024

Rebased and fix conflicts again

@wader
Copy link
Member

wader commented Nov 25, 2024

Hmm i wonder if --max-depth might be a clearer arg name?

@mgree
Copy link
Author

mgree commented Nov 25, 2024

I used --depth by analogy to find's -depth, but clarity is good! LMK if I can do anything to help.

@emanuele6
Copy link
Member

We need to touch up the JSON parser before doing a release anyway because of all the decNum related now-disclosed vulernabilities in the parser, so maybe we should hold off on merging a patch like this that may end up not being necessary

@emanuele6
Copy link
Member

But if you think it's useful to have regardless to set a limit, I guess that is fine.


Isn't the (GNU) find option that does something like this also called -max-depth though? ^^
-depth is about the order of iteration, and then I know that BSD has -depth n that is a check that is true only if the current path is at depth n

@mgree
Copy link
Author

mgree commented Nov 28, 2024

You're right---it's -maxdepth. I have literally zero feelings about what to call this option.

Whatever changes happen in the parser, there should be a way to control how it behaves on potentially expensive inputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants